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ABSTRACT 

We introduce a novel dictionary optimization method for 
high-dimensional vector quantization employed in approxi¬ 
mate nearest neighbor (ANN) search. Vector quantization 
methods first seek a series of dictionaries, then approximate 
each vector by a sum of elements selected from these dic¬ 
tionaries. An optimal series of dictionaries should be mu¬ 
tually independent, and each dictionary should generate a 
balanced encoding for the target dataset. Existing meth¬ 
ods did not explicitly consider this. To achieve these goals 
along with minimizing the quantization error (residue), we 
propose a novel dictionary optimization method called Dic¬ 
tionary Annealing that alternatively ’’heats up” a single dic¬ 
tionary by generating an intermediate dataset with residual 
vectors, ’’cools down” the dictionary by fitting the interme¬ 
diate dataset, then extracts the new residual vectors for the 
next iteration. Better codes can be learned by DA for the 
ANN search tasks. DA is easily implemented on GPU to 
utilize the latest computing technology, and can easily ex¬ 
tended to an online dictionary learning scheme. We show by 
experiments that our optimized dictionaries substantially re¬ 
duce the overall quantization error. Jointly used with resid¬ 
ual vector quantization, our optimized dictionaries lead to 
a better approximate nearest neighbor search performance 
compared to the state-of-the-art methods. 
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1. INTRODUCTION 

Since the seminal work of Product Quantization(PQ) , 
there has been a growing interest in the computer vision 
community to apply vector quantization to high-dimensional 
large scale dataset before any applications, to fight the curse 
of dimensionality!^. A typical scenario is approximate 
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Figure 1: On each iteration, Dictionary Annealing 
first picks a dictionary, and generates an intermedi¬ 
ate dataset with the residue and the dictionary, then 
optimizes the picked dictionary to better fit the in¬ 
termediate dataset, finally quantizes the dataset to 
obtain the residue for next iteration. The figure is 
best viewed in color. 


nearest neighbor(ANN) search task, which has been a funda¬ 
mental problem in many computer vision applications such 
as image retrieval and image recognition |^. Tradi¬ 
tional ANN search methods include hashing based methods 
Locality Sensitive Hashing[^, Iterative Quantization |17| , 
Spectral Hashi ng|37| , Kernelized Locality Sensitive Hashing 
|25| , LDAHash |34| , etc, they transform an original database 
vector into a sequence of bits, and then use hamming dis¬ 
tances to approximate the distances between vectors in the 
embedded hashing codes. Data structures such as Hierar¬ 
chical K-means |29| , KD-Tree |15| , R-Tree 
also proposed to perform ANN tasks. 

Product Quantization is a novel vector quantization 
method for nearest neighbor search. PQ divides the feature 
space into M disjoint subspaces of same dimensions, and 
performs k-means to learn M dictionaries with K elements 
per dictionary on these lower-dimensional subspaces. Then 
the original database vectors are approximated with the con¬ 
catenation codings of M elements chosen one per dictionary. 
PQ and its variations allow fast distance computation to per¬ 
form efficient ANN search. Given a query vector q, the dis¬ 
tances between q and each element from the dictionaries are 
precomputed. Then the distances to other database vectors 
can be efficiently approximated by M lookup tables. Thus a 
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linear scan procedure could be accelerated hundred-fold by 
PQ. Compared to hashing methods, the search accuracy of 
PQ is much higher within the same search time[^. 

To further improve the performance of PQ, optimized 
product quantization(OPQ) and Cartesian k-means (ck- 
means)[^ hnd an optimized rotation for better subspace 
partition and further lower the quantization error of PQ. 
Composite Quantizationand Additive Quantization]^ 
generalize PQ by relaxing the constraint of PQ that de¬ 
composed data space into orthogonal subspaces. Distance- 
encoded product quantization extends PQ by encoding 
both cluster index and the distance to the cluster center. 
However, these methods mainly focus on relaxing constraints 
or introducing new parameters to improve PQ. How to in¬ 
crementally improve the dictionaries learned initially so as 
to further improve vector quantization performance remains 
largely un-addressed. 

For hashing-based binary embedding methods, for exam¬ 
ple. Spectral Hashing]^, Semi-supervised Hashing 36 , they 
aim to find an efficient code that each bit has a 50 % chance 
of being 1 or 0, and that different bits are independent of 
each other. Similarly, for quantization-based embeddings, 
which encode an original vector into several chunks, we aim 
to find an encoding that each chunk has a 1/K chance of 
being 1 ■ ■ ■ K, and that different chunks are independent 
of each other. That means for each dictionary, elements 
should be evenly chosen by database vectors, also dictionar¬ 
ies should be mutually Independent. Among all dictionar¬ 
ies meeting these requirements, we seek the one makes the 
quantization error minimal. 

In this paper we propose a new dictionary optimization 
method called Dictionary Annealing (DA) which alterna¬ 
tively optimizes a single dictionary with residue and re¬ 
encode the dataset to obtain the latest residue. See Figure 

for an intuitive depiction of DA algorithm. Inspired by 
simulated annealing, the main idea of DA is to ’’heat up” a 
dictionary to a better initial position so we can ’’cool down” 
the dictionary with smaller residue left. Given a series of 
learned dictionaries by a quantization method, say. Residue 
Vector Quantization, on each iteration, DA 

1. Sorts the dictionaries by their elements’ norm, then 
uses a beam search method to fit the dataset with the 
dictionaries and obtain the residue; 

2. Picks a single dictionary to optimize: first generates 
an intermediate dataset by the sum of the residue and 
the components of the quantized dataset on this dic¬ 
tionary, then optimizes this dictionary to better fit the 
intermediate dataset. 


Similar to subspace clustering presented in 11 , DA incre¬ 


mentally optimizes a single dictionary via subspaces. DA 
first performs k-means on a d'-diniensional subspace (where 
d' depends on the information entropy of this dictionary) 
initialized by the dictionary elements on this subspace, then 
iteratively adds more dimensions and performs k-means on 
this higher-dimensional subspace, initialized by the opti¬ 
mized dictionary on the previous iteration (elements padded 
with zeros chunks). This process is repeated until we have 
fitted the whole feature space. 

Our proposed Dictionary Annealing is closely related to 
the Residual Vector Quantization(RVQ) [^, which gener¬ 
ates mutually Independent dictionaries by directly quantiz¬ 
ing on the residual vectors. However, the performance of 


RVQ is limited by unbalanced partition for the later stages 
of quantization. Nevertheless, the residual vectors can be 
used to increase the independence of dictionaries and the un¬ 
balanced partition problem can be solved via initialization 
on subspace. The empirical results show that our Dictionary 
Annealing indeed hnds a better encoding. We have validated 
our methods on two commonly used datasets for evaluat¬ 
ing ANN search performance: SIFT-IM and GIST-IM pT] . 
The dictionaries optimized by our method gained a signif¬ 
icant performance boost compared to other un-optimized 
state-of-the-art methods. 

In addition, our algorithm could be easily applied to on¬ 
line dictionary learning. For ANN tasks, the major con¬ 
cern is to speed up the query process while maintaining a 
high precision and recall, while it’s acceptable to spend more 
time on dictionary learning and encoding. In our algorithm, 
the dictionaries learned previously are not discarded but im¬ 
proved, so our online dictionary learning is done simply by 
feeding new-coming data in big batches. Online dictionary 
learning for matrix factorization and sparse coding has been 
proposed in while our algorithm aims to boost perfor¬ 
mance of ANN tasks. Experiments show that our online 
dictionary learning substantially further improves the ANN 
search quality, which makes vector quantization methods 
more effective to the ever-growing dataset in the real world 
applications. 

The remainder of this paper is organized as follows: We 
first briefly introduce quantization methods for ANN tasks 
in Section In Section we briefly discuss what makes 
good encoding for quantization methods, and present the 
observation on popular quantization methods. In Section 

we propose our Dictionary Annealing algorithm. In Sec¬ 
tion]^ we discussed the initialization, scalability and imple¬ 
mentation of Dictionary Annealing. Finally we evaluate our 
method for ANN tasks, and compared to other state-of-the- 
art quantization methods to demonstrate the superiority of 
the optimized dictionaries learned by Dictionary Annealing. 


2. QUANTIZATION FOR ANN SEARCH 

The main advantage of quantization method for approxi¬ 
mate nearest neighbor search is that Asymmetric Distance 
Computation (ADC) introduced in 21 allows fast and ac¬ 


curate distance approximation. Denote any x in dataset 
X, with ADC, we can exhaustively compute the distance 
between a query vector q and all the vectors x G X. Quan¬ 
tization methods for ANN search use a series of, say M dic¬ 
tionaries Cm = {cm(l),--- ,CmiK)},m = I,-' - , M, each 
containing K elements, to approximate a database vector as 
the sum of M vectors sequentially chosen from these dictio- 


M 

X « ^ c„(im(x)), 

m = l 

where im(x) is the index function of x. Then the Euclidean 
distance between an input query q and a database vector x 
is approximated by: 





M 

llq-xf « ||q- ^ c^(im(x))f 

m = l 
M 

= “ (^- i)llqf (1) 

m=l 

M M 

+ Y. Y1 c.(i,(x))^c,'fe(x)) 
i—1 j—l,j!^i 


For every query q, the first term is precomputed before 
the exhaustive distance computation, the second term is a 
constant for all database vectors which can be omitted, and 
the third term is precomputed on database encoding stage. 
Thus, the approximate distance between q and a database 
vector X can be efficiently computed in M table lookups and 
M addition. 

Product Quantization generates dictionaries on the dis¬ 
joint subspaces, so the requirement of computing the third 
term is eliminated. Composite Quantization 35 introduced 
an inner-dictionary-element-product to put constraint on 
the third term above, and the need for computing this term 
is also eliminated. Additive Quantization and Residual 
Vector Quantization require the third term to be en¬ 
coded together with the dataset to perform the ADC. 


3. GOOD ENCODING FOR QUANTIZATION 
METHODS 

For hashing based approximate nearest neighbor search 
methods, we seek a code that only requires a small number 
of bits to represent the full dataset while maps similar items 
to similar binary codewords. An efficient code requires that 
each bit has a 50% chance of being one or zero, and differ¬ 
ent bits are mutually independent. This is usually done by 
thresholding and find optimal orthogonal projections like 
in Spectral Hashin g[37] , Iterative Quantization[^, Semi- 
supervised Hashing [36] , etc. 

For quantization based approximate nearest neighbor search 
methods, the criterion for efficient code is essentially the 
same as the hashing based methods. We would like to ob¬ 
tain maximum information entropy(S'(C,„)) for every dic¬ 
tionary Cm and no mutual information between different 
dictionaries: 


S(Cm) = ^ pr (log2 pr ) = loga K 


Pij (h , kj ) log2 = 0 

Pk,PL 


( 2 ) 




for i,j € 1 • • • M 


where p^ denotes the probability of dictionary that in 
Cm, fc-th element is chosen; and Pijiki, kj) denotes the prob¬ 
ability that fci-th element from Ci and kj-ih element from 
Cj is chosen by a vector x simultaneously. We present an il¬ 
lustrative comparison of encoding quality with the criterion 
above between different quantization methods in Figure 
To obtain balanced partitions, PQ clusters on disjoint sub¬ 
spaces, however these subspaces could be correlated. To 
obtain independent dictionaries, previous works pre-process 
the data using simple heuristics like randomly ordering the 


dimensions or randomly rotating the space [^. Opti¬ 
mized product quantization and Cartesian k-means further 
find an optimal rotation of original feature space so that 
dimensions are de-correlated. 

Residual vector quantization(RVQ)[^ uses a different ap¬ 
proach to obtain mutually independent dictionary simply by 
learning dictionaries on the residual brought by the previ¬ 
ously learned dictionaries. However RVQ suffers from less ef¬ 
ficient single dictionary, because k-means is not really meant 
for clustering on high-dimensional data as depicted in |33| . 
K-means algorithm fails to generate good quality dictionary 
on the residual spaces, a direct observation is the low infor¬ 
mation entropy on the latter dictionaries. 

The final goal of a good encoding is to lower the quanti¬ 
zation error(the residue): 


M 

£(Ci,C2.--- ,Cm)= ^||x- ^Cm(im(x))f 

xGX ra = l 

Given a series of learned dictionaries, though they may 
encode the dataset not so well, they still contain much in¬ 
formation on the structure of the dataset. Dictionary An¬ 
nealing seeks an incremental refinement to such dictionaries. 

4. DICTIONARY ANNEALING 

The main idea of our proposed Dictionary Annealing is to 
use residual vectors to generate an intermediate dataset, i.e, 
’’heating up” dictionary. Then ’’cools down” the dictionary 
by fitting the intermediate dataset. We have two reasons for 
doing so: 

• The residual space are largely independent to other 
dictionary spaces, as observed in Figure If a dic¬ 
tionary fits the residual space well, then it gains much 
independence. 

• The intermediate dataset is actually part of the orig¬ 
inal dataset. So if a dictionary fits the intermediate 
dataset better, then the quantization error is also re¬ 
duced. 

DA also manages to find a balanced partition, we’ll explain 
it in the following texts. See Algorithm for a brief pseudo 
code for Dictionary Annealing. 

4.1 Generate and fit the intermediate datasets 

As mentioned above, residual vector quantization gener¬ 
ates largely mutually independent feature spaces, though 
traditional k-means method ended up with poor partitions. 
Anyway, the residual space is independent to all the dictio¬ 
naries’ feature space. So we add the residue to a dictionary’s 
recovered dataset to generate an intermediate dataset: 

X' = {x' = ex -f c)„(im(x)),x £ X} 

and fit this new space to increase the independence of this 
dictionary as well as decrease the quantization error. If we 
find a dictionary fits the intermediate dataset better, the 
quantization error is lowered, and the independence of this 
dictionary is increased. Then the problem comes to how to 
learn a balanced partition, and to lower the quantization 
error. 
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Figure 2: Mutual Information Matrix between dictionaries for different quantization methods. Experiment 
conducted on a subset containing lOOK 960-d vectors from GIST-IM dataset. We used different methods 
to learn M = 8 dictionaries, K = 256 elements per dictionary. The perfect encoding shonld have no mntual 
information between different dictionary and has an information entropy of logTf = 8bits for each dictionary. 
Our proposed method achieves near optimal encoding. 


Algorithm 1 Dictionary Annealing 

Input: Dataset X, dimensions d, number of dictionaries M, 
initial dictionaries {Cm, m £ 1 • • • M}, number of elements 
K per dictionary. 

Output: Optimized dictionaries: {Cm, m £ 1 • • • M} 

1: C:„ = Cm,m=l---M 
2: repeat 

3: Arrange dictionaries in norm descending order: 

> 5Zl|Cm+l||^,m G 1 • • • M - 1 

4: Use beam search encoding method described in Sec¬ 

tion [4j^ to encode X: 

M 

X = ^ Cm(im(x)) -h ex 
m=l 

where ex is the residue of x. 

5: Randomly pick a dictionary Cm, use the method de¬ 

scribed in Section [4. 1.2| to optimize the dictionary to 
better fit the intermediate dataset: 

X' = {x' = ex + Cm(*m(x)), X € X} 

(Firstly seek an di-dimensional subspace, where di = 
d- 2‘®(‘^"*)/A', then iteratively padding zeros and to fit 
higher dimensional subspace) 

6: until Quit Condition 


4.1.1 Subspace Clustering 

For the intermediate dataset, we seek a dictionary min¬ 
imizing the residue as well as having high information en¬ 
tropy. An interesting observation on the residue is that it 
stultifies k-means algorithm throughly, as illustrated in Fig¬ 
ure the information entropy could even drop to below 
5-bits on the 7th dictionary of RVQ. 

To obtain a better clustering, one of the popular approaches 
is to cluster on lower-dimensional subspace [^, this is also 
what PQ/OPQ do to obtain high information entropy for 
each dictionary. Various previously proposed methods for 
high dimensional data clustering, e.g. [^, [^, seek a clus¬ 
tering in an optimal subspace instead of the whole feature 
space. In lower-dimensional subspaces the projected datasets 
become denser and then a balanced clustering could be eas¬ 
ily obtained. Also, clustering on subspaces could be more 
interpretive as irrelevant features could exist in high dimen¬ 
sional data. Some other approaches like PROCLUS [^, uses 
a special distance function to assign each point to a unique 
cluster. 

However, in the case of fitting the intermediate dataset, 
it’s not reasonable to clustering on just a few dimensions 
as the residue lies in the whole feature space. Also the dis¬ 
tance function is already determined by applications. We 
seek a hybrid way to perform clustering on the intermediate 
dataset. 

4.1.2 Learning an Entropy Maximized Partition 






























Table 1: Comparison between different Encoding 
Schemes. GIST-IM dataset is used as it’s very 
high-dimensional and tougher to obtain a better en¬ 
coding. We randomly picked 1000 samples to per¬ 
form the encoding experiments. We used dictio¬ 
naries (M = 8, K — 256) optimized by DA initial¬ 
ized by RVQ. Our dictionary annealing (DA) encod¬ 
ing method and additive quantization (AQ) encod¬ 
ing method are compared. In addition, we imple¬ 
mented a ’’smart” brute force search which runs for 
hours encoding the vectors. We also used iterated 
conditional modes algorithm (ICM) to encode the 
dataset. DA and AQ are GPU accelerated by nVidia 
GTX980 with 4GB of dedicated memory, however, 
AQ’s encoding scheme cannot fully utilize the GPU 
because it has more memory operations. ICM and 
brute force search is run on a Intel E5-2697v2 CPU 
with the latest Intel MKL._ 


Method 

Encoding Time 

Quantization 

Error 

DA(Z = 1) 

0.021s 

0.647480 

DA(Z = 10) 

0.075s 

0.606554 

o' 

0 

II 

< 

P 

0.481s 

0.596206 

AQ(Z = 8) 

0.203s 

0.630149 

AQ(Z = 16) 

0.259s 

0.619377 

AQ(Z = 32) 

0.422s 

0.608681 

ICM 6 

26.504s 

0.627182 

Brute Force 

10000s 

0.586213 


We aim to optimize a dictionary instead of learn a new 
dictionary from scratch, as the dictionary learned previously 
could provide a better initial points for k-means[^. How 
much information of the dictionary should be used? If the 
dictionary fit the intermediate dataset well(like, have a high 
information entropy), then more information of the dictio¬ 
nary should be reserved. If the dictionary have a low in¬ 
formation entropy, we should use reduce the dimension to 
initialize k-means on a small sub-space, so the noisy parts 
of the dictionary could be removed. Then we gradually ad¬ 
just the dictionary to fit the whole feature space to obtain 
a more effective dictionary. 

Here we suggest using di = d ■ as the dimen¬ 

sion of the subspace, as it directly measures if a dictio¬ 
nary is balanced. Following [^, we first perform PCA on 
the intermediate dataset and extract the component vec¬ 
tors: R = {rf; r|^; • • • r J}. We then perform k-means on 
{Rix'l, Ri = {r^; ; • • • }, initialize it with dimension 

reduced dictionary {Rcm(A:),fc £ Iteratively, the 

learned dictionary(padded with zero chunks) is used to ini¬ 
tialize the k-means on a bigger dimensional dataset: Rnx', 
R„ = {ri; • • ■ } ,d„ > dn-i. Until we have learned the 

optimized dictionary {Rcm(A:), fc £ 1 • • • K} on {Rx'}. 

4.2 Optimized Encoding 

Encoding for product quantization is quite simple since 
the original feature space has been divided into mutually or¬ 
thogonal subspaces. For additive quantizationand com¬ 
posite quantizationj^, the encoding problem is NP-hard. 
Encoding with the dictionaries optimized by DA is also NP- 
hard. For any input vector x, we seek the code that minimize 


the quantization error E : 

M 

= 11^- Cm(im(x))f 

Tn=l 

M 

= 11^ “ c™(i™(x))||^ - (m - l)|lx||^ (3) 

m = l 

M M 

a = l b=l, 67 ^a 

The third term above can be efficiently precomputed and 
stored for any input vector, and the second term can be 
omitted as it’s a constant value. After that, the problem can 
be seen as a fully connected discrete pairwise MRF prob¬ 
lem. The optimization of E can be solved approximately 
by various existing algorithms. Additive quantization pro¬ 
posed a Beam Search algorithm in a matching pursuit fash¬ 
ion, the main idea is to maintain L best approximations, 
and the overall time complexity encoding a input vector is 
0{dMK -I- M^KLlogL). Such encoding scheme could be 
very time consuming on large M. It also requires L to be 
quite large to lower the quantization error as much as pos¬ 
sible. Suppose the best approximation (correct encoding) 
of a input vector is x ~ ci(ji) -f 02 ( 12 ) -b •• • + CAf(iM). Fur¬ 
ther assume we have known the first m — 1 correct encodings 
* 1 , * 2 , ■ • ■ , im-i, can we effectively compute im? Denote the 
known part as x = ci(ii) -!-■•• + Cm-i(*m-i) and the un¬ 
known part as x= Ci(f,„+i) -!-••■+ cmUm), we seek the 
correct encoding on the m-th dictionary im- Notice that: 

||x— X — Cm{im) — x'll^ = ||x — x||^ -|- |jx— x||^ -|- 2x^ X 
~h ||x )||^ + 2x^ Cm (^m ) -f 2c,7i(f,n)) X 

-2|lxf 

( 4 ) 

The first three terms can be seen as a constant when we 
seek the correct im , and the last term can be omitted. The 
fourth and fifth term can be effectively computed. However 
the sixth term cannot be computed because we don’t know 
X. If we omit this term extra error will be introduced. To 
lessen this error, we hope ||x|| is very small so that the vari¬ 
ance of the last term won’t have an serious impact on the 
final outcome. 

Thus we rearrange the dictionaries in the descending or¬ 
der of norm: , m £ 1 • • • M — 1. If our 

method is initialized with dictionaries learned with RVQ, 
the dictionaries naturally shrinks. We further adopted beam 
search on the scale shrinking dictionaries, that is, we main¬ 
tain a list of best L approximations of x on the first (m — 
1) dictionaries: ■ ,aj"“^}. Then we encode 

with the next dictionary Cm = {c,„(l), Cm{2), ■ • ■ , Cm{K)}. 
We find L combinations from {aJ"“^-|-c,„(A:)}, Z £ 1 • • • L, fc £ 
1 • ■ • if by minimizing the following objective function: 

||x - a™“^ - Cm{k)f =||x - a^f -b |lx - Cm{k)f 
- ||xf-b2c^(fc)^ar“' 

The first term has been computed at the previous en¬ 
coding step, and the third term ||x|]^ is constant for any 
(a["“^ +Cm{k)), thus negligible. And the last term involves 

















Figure 3: Convergence curve of dictionary anneal¬ 
ing, initialized by dictionaries learned via prod¬ 
uct quantization and residual vector quantization 
methods on GIST-IM dataset. M = 8 dictionar¬ 
ies are learned with K — 256 elements per dictio¬ 
nary. For the encoding of DA, we use L — 10. The 
vertical axis represents the quantization error and 
the horizontal axis corresponds to the number of 
iterations. The curves are Dictionary Annealing 
on dictionaries learned with Product Quantization, 
padded with zeros (PQ-DA), Dictionary Annealing 
on dictionaries learned with Residual Vector Quanti¬ 
zation (RVQ-DA), and Dictionary Annealing on dic¬ 
tionaries learned with Dictionary Annealing Opti¬ 
mized Residual Vector Quantization (DARVQ-DA) 


m table lookups and addition, with the inner-product of all 
dictionaries elements precomputed before the beam search 
procedure. Thus, only the term ||x — Cm(fc)|p is required to 
be computed. The time complexity is 0{dK + MKLlogL) 
for encoding with one single dictionary. 

To sum up, our beam search iteratively uses the top L 
candidates as seeds to find the best encoding for x with 
dictionaries arranged in a scale descending order. Our pro¬ 
posed method is quite similar to the multi-path search for 
residual tree [24| , The overall time complexity is 0{dMK -|- 
M^KLlogL). The encoding time grows with M. See Table 
□ for an empirical comparison with other encoding methods. 
It can be seen that at comparable quantization error, DA is 
much faster. 


5. DISCUSSION ABOUT THE IMPLEMEN¬ 
TATION DETAILS 

5.1 Initialization with different methods 

Our proposed method DA could be jointly used with other 
vector quantization: e.g. Additive Quantization, Product 
Quantization, Optimized Product Quantization, Composite 
Quantization, Residual Vector Quantization, etc. In addi¬ 
tion, DA can be used right on the learning stages of RVQ: On 
each stage of RVQ, first use DA to optimize the dictionar¬ 
ies learned previously and encode the dataset, then perform 
k-means on the residue. We call this method Dictionary An¬ 
nealing Optimized Residual Vector Quantization(DARVQ) 
in the following texts. 

The selection of the initializing dictionary could have an 
impact on the convergence speed and the final outcome. We 
compared initializing dictionary annealing by PQ, by RVQ, 
and by DARVQ in Figure]^ and we found DARVQ-DA has 


Figure 4: Online training of dictionary annealing, 
initialized with dictionaries learned by Dictionary 
Annealing Optimized Residual Vector Quantization 
(DARVQ) on GIST-IM dataset. M = 8 dictionaries 
are learned with K — 256 elements per dictionary. 
We divided the whole one million data into 10 big 
batches to simulate the online training. The vertical 
axis represents the quantization error and the hor¬ 
izontal axis corresponds to the number of batches 
fed to DA. 


the lowest quantization error, followed by RVQ-DA. This 
is because residual vector quantization learned dictionaries’ 
norm naturally reduces so the beam search in our proposed 
Dictionary Annealing method could perform much better. 
The norm of dictionaries learned by DARVQ shrinks even 
faster, so Dictionary Annealing could perform even better. 

We can also observe that Dictionary Annealing reduces 
quantization error faster on the first M iterations, that’s 
because on the first M iterations, the dictionaries are not 
balanced or not independent of each other. After first M 
iterations, the dictionaries are balanced and mutually inde¬ 
pendent, so the improvement space is limited. 

Here we suggest learning dictionaries with RVQ together 
with optimization by DA, and use these learned dictionaries 
to warm-start further optimization by DA. We used such 
initialization in all the following experiments. 


5.2 Scalability 

Dictionary Annealing can also be used for fitting online 
datasets. For a large-scale dataset, performing k-means on 
all data could be prohibitive, and the size of datasets may 
grow with time. Our proposed dictionary annealing can ad¬ 
justs the dictionary to fit the new coming data. Dictio¬ 
nary Annealing gradually finds an optimal dictionary close 
to the original dictionary, instead of discarding all previously 
learned information. 

The online dictionary learning is done simply by perform¬ 
ing dictionary annealing on datasets in batches. We update 
the dictionary with large batches to prevent ’’misleading” the 
optimization. The overall quantization error of the dataset 
is further reduced by such online learning process, see Figure 
|4]for a demonstration. 

Online Dictionary Learning for sparse encoding has been 
proposed in 28 , while we focus on boosting the perfor¬ 
mance of ANN tasks. Our online dictionary learning scheme 
largely prevents searching performance from degrading on 
very large datasets, while it’s very easily implemented. 



















Table 2: Computing time (in second) for differ¬ 
ent quantization methods on GIST-IM and SIFT- 
IM dataset, with M = 8, K — 256(64-bit encod¬ 
ing). All methods are GPU accelerated for fair. We 
used lOOK samples for training, and encoded the 
whole dataset, finally we performed 1000 queries. 
DA(online) fitted the whole dataset in big bathes 
(100,000 samples per batch). DA encodes the 
dataset by maintaining I — 10 best approximations, 
and for AQ encoding I = 32. We can see by results 
that speeds of different methods could vary a lot. 
The degree of parallelism and cache friendliness im¬ 
pacts the speed of these algorithms as well as the 
time complexity. 


Dataset 

Method 

Training 

Encoding 

Query 


DA(online) 

778.31s 

78.34s 

5.315s 


DA(offline) 

94.46s 

74.65s 

5.192s 

GIST-IM 

AQ 

414.69s 

392.33s 

5.224s 


PQ 

8.35s 

3.12s 

5.001s 


OPQ 

254.67s 

15.56s 

5.130s 


RVQ 

62.20s 

25.31s 

5.282s 


DA(online) 

527.46s 

67.87s 

5.231s 


DA(offline) 

64.56s 

64.56s 

5.282s 

SIFT-IM 

AQ 

339.17s 

319.17s 

5.149s 


PQ 

5.46s 

1.73s 

4.993s 


OPQ 

95.84s 

2.48s 

5.162s 


RVQ 

22.19s 

11.51s 

5.295s 


5.3 Acceleration with the latest computation 
technologies 

Our proposed dictionary annealing can be easily accel¬ 
erated by the latest computation technologies. There is no 
branch in dictionary annealing, therefore implementation on 
GPU is quite easy with significant speed boost. 

For the dictionary optimization procedure, the major com¬ 
putation involves k-means and onr proposed encoding scheme. 
K-means algorithm is very easily implemented on GPU , 
multi-core system, and implement with the latest instruction 
sets such as AVX/AVX2 [38| . For the encoding procedure. 
Our proposed encoding scheme requires enumerating L best 
approximations of an input vector from a KL combination 
lists, which requires intensive memory operations and less 
GPU-friendly. However, compared to the encoding method 
of Additive Quantization, our proposed encoding scheme re¬ 
quires less best approximations to be enumerated from a 
shorter list of total combinations. So our proposed encoding 
method is still ways faster. 

We have implemented our dictionary annealing method 
with MATLAB, we have also used GPU acceleration, so the 
entire experiments below can be done rather fast. We also 
implemented other quantization methods on GPU. We re¬ 
ported the running time of experiments done in Sectionj^for 
different methods on Table [2] and Table [3] On the dataset 
preparation, the majority of the time spent with DA is on the 
encoding stages, as well as AQ. Since our encoding method 
is faster than AQ, our approaches run much faster than AQ. 
Gompared to OPQ, which has a significant speed loss on 
very high dimensions (mainly due to the time costly SVD 
decomposition), our proposed method can handle very high 


Table 3: Computing time for learning, encoding and 
searching with 128-bit encoding on SIFT-IM and 
GIST-IM datasets of different methods. _ 


Dataset 

Method 

Training 

Encoding 

Query 


DA(online) 

3109.43s 

200.03s 

9.479s 


DA(offline) 

379.15 

197.65s 

9.415s 

GIST-IM 

AQ 

1225.02 

1131.96s 

9.582s 


PQ 

20.35s 

3.45s 

9.218s 


OPQ 

333.26s 

15.08s 

9.408s 


RVQ 

116.13s 

40.99s 

9.563s 


DA(online) 

3242.32s 

178.18s 

9.484s 


DA(offline) 

318.97s 

185.75s 

9.475s 

SIFT-IM 

AQ 

1206.88s 

1078.27s 

9.416s 


PQ 

10.74s 

1.97s 

9.270s 


OPQ 

176.81s 

2.66s 

9.300s 


RVQ 

43.20s 

25.31s 

9.432s 


dimensional data easily. For the query time, though AQ, 
RVQ and our DA requires an additional fix to compute the 
approximated distance, it actually doesn’t affect the query 
time: this is because on modern memory device it takes al¬ 
most the same time to perform memory chunks copy or reset 
the memory chunks. The slight query time difference is due 
to pre-computation: AQ/RVQ/DA requires 0{MKd) time 
computing the distance between dictionaries elements and 
the query, while PQ/OPQ requires only 0{Kd). OPQ re¬ 
quires an additional vector rotation operation which takes 
O(d^) time. 


6. PERFORMANCE ON ANN TASKS 

In this section we report the ANN tasks performance of 
dictionaries optimized by Dictionary Annealing, and com¬ 
pare it to the other state-of-the-art methods. 

6.1 Datasets 

We performed the ANN search tests on the two datasets 
commonly used to validate the efficiency of ANN methods: 
SIFT-IM and GIST-IM from [2^: 

SIFTIM contains one million of 128-d SIFT features. 
It’s commonly used local feature descriptor for various 
image related applications. 

GISTIM contains one million of 960-d GIST global 
descriptors. 


For each dataset, we randomly pick 100,000 vectors as the 
learning set. We then encode the rest of the database vec¬ 
tors, and perform 1000 queries to check ANN search quality. 


6.2 Evaluated Methods 

We compared our DA to the following state-of-the-art 
quantization methods: 

PQ : Product quantization proposed in Following [21| , 
we used the structured ordering for GIST-IM and the 
natural ordering for SIFT-IM. 


OPQ : Optimized Product Quantization proposed in |16| . 
We adopted the non-parametric version of OPQ. Carte¬ 
sian k-means, the algorithm proposed in 30 shares a 
similar idea and have the same performance with OPQ. 

















Recall®/?: SIFTIM, 64bit 


Recall®/?: GISTIM, 64bit 



—DA-online □ DA-oflline - e- RVQ - AQ - ®- OPQ - -o- PQ 


Figure 5: The performance for different algorithms on SIFT-IM and GIST-IM, with 64 bits encoding(M = 
8,K = 256). 


AQ : Additive Quantization]^. Another similar algorithm 
is Composite Quantization]^, which introduced a con¬ 
straint named inner-dictionary-element-product on the 
encoding of the vectors to prevent computation of a 
’’bias” in the asymmetric distance computation. 

RVQ : Residual Vector Quantization proposed in [10| . 

For all the methods, we choose k = 256 as the size of each 
dictionary, because it results a small look-up table and each 
subindex hts into one byte, which is instruction and cache 
friendly to modern CPU/GPUs. We choose M = 8/16 to 
encode short codes for the dataset, resulting 64bit/128bit 
encodings. Such encoding greatly compressed the original 
dataset. For SIFT local descriptors, the original vector is 
128d floating point numbers, which takes 128*32bits space. 
Quantization methods gain 1/64 compression ratio. For 
GIST global descriptors the compression ratio is even low¬ 
ered to 1/480. An in-memory exhaustive search for these 
datasets is feasible. 

For our DA methods, we conducted M = 8/16 itera¬ 
tions. We used the dictionary obtained by DARVQ to ini¬ 


tialize Dictionary Annealing. On optimizing with interme¬ 
diate dataset, we let the dimensions grow exponentially to 
d (the original dimensions of input dataset) in 5 iterations. 
In addition, we conducted the online dictionary optimiza¬ 
tion with Dictionary Annealing to learn a dictionary better 
fitting the whole dataset. Datasets are fed to DA in lOOK 
sample batches. 

To hnd ANNs, we perform linear scan search with asym¬ 
metric distances computation(ADG) proposed in ]^, which 
directly compare the input query and the quantized dataset. 
The search quality is measured using recall®/?, which means 
for each query, we retrieved R nearest items and check whether 
they contain the true nearest neighbor. Such criterion is 
commonly used to check efficiency of ANN methods. 

Since the quantization based ANN search methods outper¬ 
form hashing based binary embedding techniques [^, [21|, 


30 , we do not present the results of hashing performance 


in our tests. 


6.3 Results 

Figure]^shows the performance comparisons between dif- 















Table 4: Squared quantization error(i? = ||x —x|j^) on 
GIST-IM dataset of different quantization methods, 
with M = 8/16, K — 256. DA encodes with I — 10, and 
AQ encodes with I = 32. 


Method 

64bit 

128bit 

DA(online) 

0.609222 

0.464948 

DA(offline) 

0.637022 

0.492671 

AQ 

0.679694 

0.521014 

PQ 

0.742063 

0.606044 

OPQ 

0.680419 

0.531976 

RVQ 

0.727788 

0.618995 


Table 5: Squared quantization error on SIFT-IM 
dataset of different quantization methods, with M — 
8/16, K = 256. DA encodes with I = 10, and AQ 
encodes with I = 32._ 


Method 

64bit 

128bit 

DA(online) 

16479.11 

7858.23 

DA(offline) 

17648.08 

9148.75 

AQ 

19032.97 

9176.82 

PQ 

23106.71 

10332.61 

OPQ 

21183.56 

9831.85 

RVQ 

20067.97 

9901.05 


ferent methods on 64bits and 128bit codes. One can see 
that our DARVQ-DA optimized dictionaries offer signifi¬ 
cant improvements to the original RVQ dictionaries. For 
example, on 64bit encoding, with DARVQ-DA we obtained 
31.8% recall@l on SIFT-IM, while RVQ is only 25.4%, the 
relative improvement is 25.2%. The improvement gain is 
even larger on higher dimensional data GIST-IM, where we 
gained 14.6% recall@l with DARVQ-DA and only 9.3% with 
RVQ, relatively 56.9% improvement. 

Using our DA optimized dictionaries for ANN tasks also 
outperforms other state-of-the-art methods. The offline DA 
optimized dictionaries with DARVQ outperforms Additive 
Quantization by 11.8% , and the online version outperforms 
Additive Quantization by 16% in terms of Recall®! on 64bits 
encoding on SIFTIM. On 128bits encoding AQ and DA- 
offline performs similar, we speculate that AQ has already 
found near-optimal dictionaries fitting the learning dataset 
so the improvement is limited(DA-online learns a better dic¬ 
tionary with all the data). Generally DA optimized dictio¬ 
naries has the best performance with noticeable advantage. 
That’s because our Dictionary Annealing could gain a lower 
quantization error. The quantization error of different meth¬ 
ods are reported on Table and Table 

7. CONCLUSION AND FUTURE WORKS 

In this paper, we introduced Dictionary Annealing method 
for optimizing dictionaries used by quantization based ap¬ 
proximate nearest neighbor search methods. We first dis¬ 
cussed what makes good encoding: high inter-dictionary in¬ 
dependence and high inner-dictionary information entropy. 
We observed that residual vector quantization easily pro¬ 
duces independent dictionaries, and clustering on subspace 
generates a balanced partition. Motivated by these observa¬ 
tions, we use residual vectors to increase the independence of 


dictionaries, and perform warm-started k-means with clus¬ 
ters on subspaces to learn better dictionaries. We also used 
an optimized multi-path encoding method to aid the dic¬ 
tionary annealing procedure. Dictionary Annealing could 
make significant improvements to the dictionaries learned 
by other methods, especially the dictionary learned by resid¬ 
ual vector quantization. Empirical results on the SIFT-IM 
and GIST-IM datasets commonly used for evaluating ANN 
search methods demonstrated that our proposed approach 
outperforms existing methods. 

Our major contribution is to show optimizing dictionaries 
with residue could bring significant performance gain while 
not modifying the original framework intensively, and on¬ 
line optimizing dictionaries could bring even more perfor¬ 
mance gains. Gurrently, the main limitation of the proposed 
scheme is the speed of encoding. For more dictionaries our 
proposed method have to deal with growing inner-product 
variances of inter-dictionary elements. Our future work will 
focus on eliminating such variance, so further performance 
gains could be possible. 
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